Global Address Space, Non-Uniform Bandwidth: A Memory System Performance Characterization of Parallel Systems
نویسندگان
چکیده
Many parallel systems offer a simple view of memory: all storage cells are addressed uniformly. Despite a uniform view of the memory, the machines differ significantly in their memory system performance (and may offer slightly different consistency models). Cached and local memory accesses are much faster than remote read accesses to data generated by another processor or remote write to data intentionally pushed to memories close to another processor. The bandwidth from/to cache and local memory can be an order of magnitude (or more) higher than the bandwidth to/from remote memory. The situation is further complicated by the heavy influence of the access pattern (i.e. the spatial locality of reference) on both the local and the remote memory system bandwidth. In these modern machines, a compiler for a parallel system is faced with a number of options to accomplish a data transfer most efficiently. The decision for the best option requires a cost benefit model, obtained in an empirically evaluation of the memory system performance. We evaluate three DEC Alpha based parallel systems, to demonstrate the practicality of this approach. The common DEC-Alpha processor architecture facilitates a direct comparison of memory system performance. These systems are the DEC 8400, the Cray T3D, and the Cray T3E. The three systems differ in their clock speed, their scalability and in the amount of coherency they provide. The DEC 8400 is a shared memory, symmetric multiprocessor based on a high speed bus offering sequential consistency; the Cray T3D and T3E are scalable multicomputers based on a scalable 3D torus interconnect and either do not cache remote accessesat all (T3E) or provide only partial memory consistency within a node (T3D) and therefore typically leave consistency to the application or compiler. Our performance characterization shows that although the clock rate of the DEC 8400 doubled compared to the Cray T3D, the DEC 8400 offers only modest improvements in the performance of remote memory operations over the Cray T3D. The local and remote memory system performance of the Cray T3E This research was sponsored in part by the Advanced Research Projects Agency (ITO) monitoredby SPAWAR undercontract N00039-93-C-0152. T. Stricker’s current address: Institut für Computer Systeme, ETH Zürich, Switzerland. Copyright 1997 IEEE. Published in the Proceedings of the THird International Symposium on High Performance Computer Architecture, February 1-5, 1997 in San Antonio, Texas, USA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966. matches the doubled clock speed of the processor.
منابع مشابه
Designing Multisocket Systems with Silicon Photonics
To fuel an increasing need for parallel performance, system designers have resulted to using multiple sockets to provide more hardware parallelism. These multisocket systems have limited off-chip bandwidth due to their electrical interconnect which is both power and pin limited. Current systems often use of a Non-Uniform Memory Architecture (NUMA) to get the most system memory bandwidth from li...
متن کاملUsing Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory
Shared memory is widely believed to provide an easier programming model than message passing for expressing parallel algorithms. Distributed Shared Memory (DSM) systems provide the illusion of shared memory on top of standard message passing hardware at very low implementation cost, but provide acceptable performance for only a limited class of applications. We argue that the principal sources ...
متن کاملData locality optimization of shared memory programs on NUMA architectures using an integrated tool environment
Due to their excellent price-performance ratio, clusters built from commodity nodes have become broadly adopted and increasingly popular as platforms for parallel processing. Among them, the clusters of standard PCs interconnected with high-speed system area networks (SANs) are especially attractive and have been widely established. At the same time, the developments in interconnection technolo...
متن کاملImplementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience
The Berkeley UPC Compiler is an open source, high performance and portable implementation of Unified Parallel C (UPC), an SPMD global-address space language extension of ISO C. In previous work, we have experimented our compiler on a variety of high-performance networks and parallel architectures, including distributed memory machines and clusters of SMPs. Our goal in this paper is to implement...
متن کاملA HyperTransport-Enabled Global Memory Model For Improved Memory Efficiency
Modern and emerging data centers are presenting unprecedented demands in terms of cost and energy consumption, far outpacing architectural advances related to economies of scale. Consequently, blade designs exhibit significant cost and power inefficiencies, particularly in the memory system. For example, we observe that modern blades are often overprovisioned to accommodate peak memory demand w...
متن کامل